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The majority of the system on chip (SoC) uses the network on chip (NoC) as 
routing ports for data transfer from node-to-node with minimal power 
consumption and low latency and high throughput. This paper concentrates 
on the ability to model the asynchronous NoCs on the asynchronous circuits 
on field programmable gate arrays (FPGAs). A 3x3 NoC and its universal 
asynchronous receiver transmitter (UART) protocol is designed and its 
simulation of the Verilog hardware description language (VHDL) code is 
done and tested on the Artix-7 FPGA kit, the testing processes in done using 
the Chipscope tool. In order to meet target requirements in terms of power 
consumption and latency, the label switching (LS) technique is used as 
routing. The proposed LS-NoC with level-encoded dual-rail (LEDR) 
encoding technique provides throughput by registering the packet between 
the different routers and it helps to improve throughput and speed. The 


Network on chip effectiveness of the data transfer is measured and analyzed through a 
NoC manager synthesis summary in terms of lookup table’s (LUT’s), slice registers, flip 
UART flops’s (FF’s), latency, and packet delivery ratio (PDR) for the traffic pattern 
generator. The proposed NoC is designed for 8x8 and each port size is 21 
bits including ID’s of source and destination routers. The results can be 
justified by following results: improvement of LUTs is about 12%, flip-flops 
are 7%, improvement of throughput is 23% and delay is reduced by 26%. 
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1. INTRODUCTION 

One of the developing fields contained by the system on chip (SoC) analysis area is network on chip 
(NoC). There are several papers published by researchers about the developing fields of NoC systems and 
their application [1]. In the meantime, there are several improvements and developments in the structure of 
asynchronous and asynchronous NoCs [2]. Message-passing asynchronous NoC is guaranteed service over 
open core protocol (OCP) interfaces and is developed to a fully grown network in high speed NoC [3], [4]. 
The favorable services offered by the asynchronous message-passing asynchronous NoC providing 
guaranteed services over OCP interfaces (MANGO) are bounded services [5], [6]. The interfacing of OPC 
collaborates with NoC, this is associated with the core. The global science (GS) network and the built 
environment (BE) network are the two main components of any NoC network [7], [8]. The virtual channels 
support the connection-oriented GS services, these services are measured with the latency and hard 
information that promises better utilization. The BE network is empowered with the packets that are routed 
within the wormhole routers [9]. In the initial research, we often find the execution of the asynchronous 
circuits on field programmable gate array (FPGA) is very narrow and confined [10], [11]. So, here we are 
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eager to implement a well-sophisticated approach that makes the implementation a better way in the 
execution. The NoC has been initiated with the best-effort NoC, in the elementary asynchronous mode [12]. 
A router, master, network adapter (NA), and slave NA are the three components of the NoC. The routers are 
interconnected in a mesh topology and the warn hole routing is used for the communication. The use of 
supply routing and the XY-routing was issued to avoid the deadlocks [13]. The number of lists can be 
unlimited in the packets [14]. 

The four basic elementary units of NoC are intellectual property (IP) cores, NA, routers (RO to R8), 
and links. Figure 1 shows the outline of a 3x3 NoC module. NoC is a super technique where we can see the 
cores within can easily communicate with each other in a very accurate way [15]. The execution and 
implementation of the FPGA are completely theoretical, so it is preferred to execute the BE NoC that has been 
performed. The primary concern of the thesis is availability, and the least prioritized issue is execution [16]. The 
area of the complete structure is low, this is result of the accessible logical resources on the given FPGA [17]. 
The next part of the thesis will eventually show the correct model for the selection of NoC design [18]. The 
topology selected should be suitable for the outlines that are specified by the FPGA. The conditions of the 
topology that are to be concerned are listed in [19]. The successor of the next node shall always be a one- 
directional link like a torus or a K-Ary 2 cube mesh or the torus topology. At the stage of selection, a two-way 
link of A K-Ary 2 cube network is selected. The basic reason for the selection is to be free from the deadlock 
that occurs, whereas the torus has a huge abundant number of links [20]. If the topology is integrated with XY 
routing, the deadlocks can be removed without the simulations of virtual channels the architecture of FPGA has 
a well onto the structure of topology in two dimensions. The further needs of a K-Ary 2 cube network topology 
are: there are four ports for network connections, one port for a core affiliation, and p described in [21]. 
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Figure 1. General structure of NoC connected in a 3-by-3 mesh topology 


2. PROPOSED LOW POWER ROUTER DESIGN FOR LABEL SWITCHING-NOC’S 

Label switching (LS) technique is used in many networks such as automatic teller machines (ATMs) 
and banking applications since it is purely dependent on packet relaying because LS will carry route 
information in the form of labels within the network. Another function of LS is to change the direction in X- 
Y coordinates for transmission of a packet from one route to another route by identifying the next router 
through forwarding information, quality of service (QoS), guarantee, and traffic priority and finally, it assigns 
to nest route label [22]. The LS is applied for the transmission of screaming data with more area consumption 
and high power utilization. The microarchitecture of the single router is shown in Figure 2 and it consists of 
first in first out (FIFO) and its control block, NoC manager, crossbar switch, and arbiter. This proposed work 
is mainly concentrated on reducing power using the bit transition encoder and decoder (BTED) technique as 
shown in (3) and (4). The existing LS-based NoC is for streaming applications that limit latency and 
hardware utilization. These limits are mainly addressed in this research work with the help of a NoC manager 
which can monitor and control bandwidth sharing and its adjustment automatically. The NoC manager uses a 
flow graph (FG) to represent communication between source and destination nodes which are updated and 
stores the packet and updates their bandwidths in a table known as the routing table, the source router present 
in FG is to process the packet which is generated through traffic generator is processing engine through input 
and output ports. This engine and input and output ports receive the data to form sink nodes in the FG 
from source to destination and intermediate nodes are represented as edges and stored in the FG is given in 
Figure 2 and its edges are shown in Table 1. 
Aj 


Ei =$ {N; N;, u; 


ij LPY, aaa (1) 
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Where E;j is the edges connected between the source (s) and destination (d) and these can be any 
nodes out of 64 nodes. The N; is source node, N; is the destination node, uj; is utilized (already used by 
another router), A;j is the available node (or free node in FG), Lp is node present in the list of labels used 
in the pipe through £;; and Lpyeees is node present in FG which is not used by any other router. 

During transmission of the packet, Ley ioe are equal to “NULL” when no data is 


available and their capacity or bandwidth are completely utilized and not available to serve further with any 
other router for data transmission. During the transmission of data, A;; will have maximum capacity or 


bandwidth and Lp eee is not used by any other router and it will be free for serve and available for data 


transmission. For effective data transmissions, the pipe should have maximum capacity ‘c’ and it will 
establish communications between source (s) and destination (d). 
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Figure 2. Proposed label switched-based microarchitecture of single router with single-cycle flit traversal and 
their internal micro blocks including power optimization using BTED 


Table 1. LS-NoC and BTED based NoC and its routing table’s from 33 to 16 and 17 to 5 
Ei Ni Nj wj Aj LPer — Lpumused 


2526 1 6 10 O {0} {0..64} 
27.28 2 8 10 0 {0} {0..64} 
33.34 3 4 0 10 g {1..64} 
3435 4 5 0 10 QO {1..64} 
36.37 5 7 0 10 g (1.64) 
38.39 6 9 0 10 g {1..64} 
3031 7 1 10 0 {0} {0..64} 
3940 8 1 10 0 {0} (1.64) 
4032 9 3 10 0 {0} {1..64} 
3224 10 5 10 0 {0} {1..64} 
2416 11 6 10 0 {0} (1.64) 


In the proposed design, the major sub-systems are routers, network adaptors, switching algorithm, 
label-based routing technique and power optimization method. These all are integrated as SoC level to meet 
requirements of IP with optimal power, area, latency and throughput. All these sub-systems are part of NoC 
and it is integrated as SoC systems for interfacing with high speed Cortex-M33 processors and other 
controllers through different protocols [23]. In Algorithm 1, the first step is for FG creations, from which, the 
input packet includes both information data and destination id as labels and FG contains the number of edges 
and capacity determination [24]. All edges in FG will change their directions from Ej; to Ej;, based on the 
next router which is depending on the destination node. The FG also stored the capacity of each and every 
link (path). The third step is to monitor the number of packets transmitted and received between routers. In 
the fourth step, the data is stored in output ports when the source and destination node is the same. The 
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remaining steps are to perform the data transmission based on the bandwidth available at every node and 
finally, the received data is stored in Ps. Once the packet is reached the destination, FG will update its edges 
and push the packet data to output ports, and also store it in stack pointer (sp). After the destination node and 
pipe is identified, the SP is used for updating the used list and pipe (LPHS°®) and available (LPH) flows 
based on capacity or bandwidth. Once FG is updated, the NoC manager will configure the routing table, and 
label updating is performed in each and every intermediate router. Along with FG updating, the Lpysee of 


each edge will check for conflicts, if it is a conflict then the NoC manager will identify the alternative port or 
router which is unused in the routing table. This table data structure at a node can be written as shown in (2): 


Nis Pora = N; Enew (2) 
where Porą is the pipe label in the edge ending at N; and Epew is the pipe label in the edge in E;;. 


Algorithm 1. NoC manager: identification of pipe, ps: pipe stack 
Define the source node as the pipe in the NoC: 
Input required: Ej; = {N; Nj, uij, Aij LPHSe4, LPH 4}, s, d and c 
Define flow graph (FG) ={E;j}={N; Nj, uij, Aij, LEHES, LPmsedy 
Initialize counter value to ‘0’ i.e k=0 
Suppose s=d then 
Do not perform any process and store source packet into output ports of same router and 
update ps=s{data} 
if s != d then 
For all edges starting from s, s+1,.. to d perform loop 
If Ajj> c then {If available node capacity is greater than c} 
Data_out = data packet is sends to next router input port and update the 
FG based on node id and then push data into sp 
If Ajj<c then 
Search for alternative router and its free input and output ports and then push into sp. If 
d= destination id then 
Ps = d{data} and update FG and extract data packet by removing the label bits 
End 


The Figure 3 shows 8x8 LS-NoC in 2D mesh topology. The NoC manager is part of every router 
and each router has five input and output ports (East, North, South, West, And Local) and processing 
elements along with IP blocks that store the received packet at destination node. The single LS-based router 
is designed using combinational circuits between input and output ports. The received data from the source 
system i.e. the device which is generating electrocardiogram (ECG) signals are stored in FIFO if other flits 
are awaiting traversal or if the arbiter does not provide grant access to the output port [25]. The FIFO control 
block (FCB) will take care of the FIFO pointer arithmetic and control the corresponding input port’s signal 
flow. 


1 2 3 4 Seal nih 7 8 
i 

9 10 11 12 13 i| | 14 15 16 
M7) Daa a aa | 22 23 4 
235 | [2 | [2] 28 | [29 30 31 32 
33 s---3---s-H-p-- nta 40 
4 42 43 4 45 46 47 48 
49 50 51 52 53 54 55 = 


Sla Ba ag PY EY a aa 


Figure 3. FG of the LS-NoC and BTED architectures during data packet transmission with two sources and 
two destinations marked as red and green 


The FIFO-based router design of LS-NoC can handle multiple clock domains which are 
asynchronous. Multi-clock buffers can be used in place of the buffers at the router's output port, and they can 
be linked to dual clock interfaces. Because of the nature of the pipe formation process, LS-NoC has built-in 
fault tolerance. Following the detection of a defective connection, the LS-NoC manager takes the following 
steps: 
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— When a bad connection is detected, the LS-NoC Manager sets the capacity of that link to 0 in the FG. 

— Existing pipes connected to the connection are deactivated. The pipes have been renamed, and the 
routing tables have been modified. 

— After pipes are configured, the FG is updated. 

The NoC manager's overhead is made up of two parts: computation and configuration. Identifying a 
pipe with a flow-based method (Algorithm 1) incurs computational cost. Routing table configuration is 
transmitted across the network and routing tables are updated as part of the configuration overhead (Table 1). 

In the proposed design, the major sub-systems are routers, network adaptors, switching algorithm, 
label-based routing technique and power optimization method. These all are integrated as SoC level to meet 
requirements of IP with optimal power, area, latency and throughput. All these sub-systems are part of NoC 
and it is integrated as SoC systems for interfacing with high speed Cortex-M33 processors and other 
controllers through different protocols. 


3. BIT TRANSITION ENCODER/DECODER FOR POWER OPTIMIZATION IN NoC 

The power consumption and its optimization in NoC is major challenging task and it will degrade 
the performance level. In this work as shown in Figures 4 and 5, bit transition encoder technique is applied 
before transmission of packet to source router and after receiving packet at destination router for power 
optimization. In any on chip memory or networks, power consumption is depending on number transitions 
such as bit 1 to bit 0 (formally known as type 1) or bit 0 to bit 1 (formally known as type 2), there is not bit 
transition if both bits are same like bit O to bit 0 (formally known as type 3) or bit 1 to bit 1 (formally known 
as type 4). The power reduction technique will work only on type 1 and type 2. The power optimization 
purely works based on number bits transitions in packet data, if there are more number of transition bits then 
encoding techniques is going minimize before sending the packet to next router. The generalized logical 
expression for encoding are given in (3) and (4). 


Encoded; = data; ® data;_, ® Fl foralloddnumber E i (3) 
Encoded; = data; ® data;_, Ð FI © Hl forallevennumber €E i (4) 


Where FI is full invert, it can be either 1 or O and HI is half invert, its bit is same as FI, data; is 
present bit in given packet and data;_, is previous bit in given packet, between these two bits, the XOR 
operations is performed to reduce number transitions, for example, let consider number of bits in packet is 16 
bits (let say: 1010101010101010), the number of transitions are 15. After performing bit transition encoder 
on packet through XOR operation, encoded bits are 1111111111111111 as shown in Figure 3, number of 
transitions in encoded bits are 0, therefore number transitions are reduced from 15 to 0. 


p dk 


DA akg datain[15:0] 


ib FI 
ib Hi ï 
p B dataout[15:0) f 11111111111 1111111111111111 


Figure 4. Simulated results of power reduction through bit transition encoder technique, here input is 16 bits 
(1010101010101010) and output is 16 bits (1111111111111111) 


The encoded packet is transmitted from source router to destination router, at destination router, 
after receiving packet before decoding, the bit transition decoder is applied to decode the original packet. The 
generalized logical expression for decoding are given in (5) and (6): 

Decode; = RCgata; ® RCaata;_, ® FIforoddbitsofi (5) 

Decode; = RCgata; ® RCaata;_, B FI ® HIforevenbitsofi (6) 


after applying (3) and (4), the simulated results are shown in Figure 3, the decoded packet bits are same as 
packet bits which is transmitted at source node. 
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Figure 5. Simulated results of power reduction through bit transition decoder technique, here input is 16 bits 
(1111111111111111) and output is 16 bits (1010101010101010) 


4. RESULTS AND DISCUSSIONS 

The proposed LS-NoC with power optimization technique is successfully synthesized using Xilinx 
Design Suite 14.7 software tool and implemented on Artix-7 FPGA development. The delay and throughput 
and figure of merit are analyzed between source (Roo) and destination (Ros) nodes through simulated results 
shown in Figure 5. In order to proof the latencies between different routers, considered first router is always 
source routers and others are destination routers, the latency is measured from source to any other routers as 
shown in Table 2. The second column in the Table 2 shows different latencies, for example 10 and 15 is 
latency from router 3 to router 6 (shown in destination node column). Similarly, for throughput and 
frequencies are shown in Table 2. In Figure 6, the very first signal is clock of 100 MHz followed by input 
and output data of source router and destination routers and they are highlighted in the separate box. 
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Figure 6. Simulated results of 3x3 LS-NoC and their received data at each input and output ports 


Throughput (thp): The proposed 8x8 NoC system's throughput is calculated as the ratio of the total 
amount of bits to be transmitted by the simulation time, in seconds, to the total number of bits to be conveyed 
in a given time, per sec, and is represented as (7). 
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npck*packetfiit_; 
thp = —<—_“zise (7) 


nsiMpkt*T 


The number of packets transmitted per clock is given by np,,. The np,, is no. of transmitted packets per 
cycle, as packet pyt is packet size of 16 bits, f litzise is size of flit of 16 bits, nsimpxt is total latency of every 
packet transmission and T is total cycle period. 

The hardware platform for implementing NoC with considerably different factors in the proposed 
system is the Xilinx Design Suite, which has already been used by the prior systems. Table 3 compares the 
summary of previous work with the proposed work. The relative plots and thorough analysis of NoC with 
and without the application of power optimization technique are shown in Figures 7 and 8. 

As a result, when compared to the existing work, the suggested system performs better in all 
parameters, whether or not LS is used to store and then transmit ECG signals, as shown in Figure 9. The 
proposed LS-NoC is extended from 3x3 to 8x8 to analyse latency and routing paths that having totally 64 
routers, the simulated results of 3x3 is shown in Figure 10 for the source 3 and destination node 6. 


Table 2. Source and destination router 


Parameter Source router/value of delay/TP/latency Destination node 
Node ROO TO 
Delay (D) in ns 10, 16 (as per Figure 6) R03 R06 
Delay (D) in ns 10, 16, 40 (as per Figure 7) R03 R04 R06 
Delay (D) in ns 10, 16, 35,40 (as per Figure 8) R03 R04 R05 R06 
Throughput (TP) in MHz 6, 4, 2.2, 5.9 R03 R06 
TP in MHz 3.1, 3.2, 3.1, 3.4 R03 R04 R06 
TP in MHz 4.1, 4.6, 4.9 R03 R04 R05 R06 
Latency (L) in pico seconds 0.9, 0.4, 2.1, 4.9 R03 R06 
L in pico seconds 4, 3,3.1,5.9 R03 R04 R06 
L in pico seconds 3.1, 3.3, 3.5, 3.5 R03 R04 ROS R06 


Table 3. Summary of the existing work with proposed work 


Parametri Without ECG signals With ECG signals 
Existing work [26], [27] Proposed work Existing work Proposed work 

Slice registers 3930 2399 3634 2864 
Slice LUT’s 4671 2590 4090 2972 
Slice FF’s 2495 1855 3635 2843 
Delay in ns 5 3.024 19.285 14.4 
PowermW 43.06 8.2 214 82 

Area 16854 3694 12353 2529 
Frequency (MHz) 312.69 391.850 75.27 450.85 
Throughput (Gbps) 80 260.8 --- 170.8 


BÄ South_in[19:0] 

EA ROUTED_pack_1[19 
MÄ ROUTED_pack 

BÄ ROUTED _p 


Mi South i 


BÄ ROUTED_pack_5[19 


BË ROUTED_pack_1[19 
M ROUTED_pack_2[19 


Ri ROUTED_pack_4[19 
T z 


BÄ ROUTED p 
MÄ ROUTED_pack 
EA ROUTED_pack_4[19:] €11b 


X1: 35.827451us ff X2: 40.000000 us $ AX: -4.172549 us 


Figure 7. Delay calculation between source and destination of 3x3 NoC, intermediate nodes are 3 and 6, node 
3 is received data at 10 ns, node 4 is received the data at 40 ns and node 6 received at 16 ns 
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Figure 8. Delay calculation between source and destination of 3x3 NoC, intermediate nodes are 3, 4, 5 and 6, 
node 5 is received data at 35 ns, node 6 is received the data at 40 ns 
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Figure 10. Delay calculation between source and destination of 3x3 NoC, intermediate nodes are 3 and 6, 
node 3 is received data at 10 ns and node 6 received at 16 ns 


To analyse details latency, we have considered source id is 01 and destination id is 15 as shown in 
Figure 11, it is found that 1 ns delay for router 01 to 03 and 3 ns delay for router 04 to 15. The packet is 
transmitted from source 01 through 02, 06, 08, 09, 10, 11, 12, 13, 14 and to 15 as shown in Table 2. Due to 
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congestion and contention because of conflict, packet is not travelled through 03, 04, 05, and 07, this is 
clearly observed in simulated results as shown in Figure 12. 
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Figure 12. Simulated results of LS-NoC with LEDR encoding technique 


The test bench of top-level design includes both LS-NoC and LEDR and its implementation 
includes 6-rail voltage with full functionality and inputs for the LEDR come from text files, in which the 
voltage levels are specified are continuously looped through during simulation as shown in Figure 10 
Because of this, there is no effect from the VRAIL_EN signal on the simulated analog input (Voltage). The 
analog input will not rise when VRAIL_EN is asserted, nor will it fall with VRAIL_EN is de-asserted as 


shown in Figure 11. 
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5. CONCLUSION 

Dynamically Reconfigurable network on chip (DRNoC) NoC uses mesh topology and XY-routing 
with deadlock freedom to minimize latency. The streaming data are converted into packet which consists of 
source, destination id’s and flit bits. These packets are encoded by adding two additional request signals like 
handshake signals. The proposed design has asynchronous clocks which are synchronizer is used to manage 
synchronization. The router has 1302 LUTs as well as 530 latches in its region, with delay elements using 
12% of the LUTs. The router's overall output measured was found to be 46 MHz. Three CPUs as well as 
three external units make up the prototype, which is connected via a 3x2 mesh. The power, as well as the 
area used by router buffers in NoC, seem to be a major issue in the deep submicron domain which 
elimination of buffers. When compared to another conventional bufferless routing algorithm, the 
computational results demonstrate that the designed routing algorithm optimizes average latency by 22%, 
power consumption by 21%, as well as area overhead by 44%. An 8x8 switch router with a suitable shortest 
path detector, such as a minimal spanning tree, is utilized to design the suggested network architecture for 
effective run-time routing. Therefore, Verilog hardware description language (VHDL) is been chosen for 
executing in VIVADO Xilinx 2018-1 software and is implemented on Nexys DDR-4 Artix-7 FPGA family 
with a part number XCA7CGS100t, which has 324 pins, with improved accuracy as well as 35% latency and 
when compared to the conventional router, the proposed router increases the efficiency by 40% and this 
technique outperforms the traditional one in terms of delay, area as well resource allocation. 
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